Hierarchical multimodal transformer to summarize videos

نویسندگان

چکیده

Although video summarization has achieved tremendous success benefiting from Recurrent Neural Networks (RNN), RNN-based methods neglect the global dependencies and multi-hop relationships among frames, which limits performance. Transformer is an effective model to deal with this problem, surpasses in several sequence modeling tasks, such as machine translation, captioning, etc. Motivated by great of transformer natural structure (frame-shot-video), a hierarchical developed for summarization, can capture frame shots, summarize exploiting scene information formed shots. Furthermore, we argue that both audio visual are essential task. To integrate two kinds information, they encoded two-stream scheme, multimodal fusion mechanism based on transformer. In paper, proposed method denoted Hierarchical Multimodal (HMT). Practically, extensive experiments show HMT achieves (F-measure: 0.441, Kendall’s ?: 0.079, Spearman’s ?: 0.080) 0.601, 0.096, 0.107) SumMe TVsum, respectively. It most traditional, attention-based methods.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning to score and summarize figure skating sport videos

This paper focuses on fully understanding the figure skating sport videos. In particular, we present a large-scale figure skating sport video dataset, which include 500 figure skating videos. On average, the length of each video is 2 minute and 50 seconds. Each video is annotated by three scores from nine different referees, i.e., Total Element Score(TES), Total Program Component Score (PCS), a...

متن کامل

For Your Eyes Only: Learning to Summarize First-Person Videos

With the increasing amount of video data, it is desirable to highlight or summarize the videos of interest for viewing, search, or storage purposes. However, existing summarization approaches are typically trained from third-person videos, which cannot generalize to highlight the first-person ones. By advancing deep learning techniques, we propose a unique network architecture for transferring ...

متن کامل

Hierarchical Spatial Transformer Network

Computer vision researchers have been expecting that neural networks have spatial transformation ability to eliminate the interference caused by geometric distortion for a long time. Emergence of spatial transformer network makes dream come true. Spatial transformer network and its variants can handle global displacement well, but lack the ability to deal with local spatial variance. Hence how ...

متن کامل

A Hierarchical Approach to Multimodal Classification

Data models that are induced in classifier construction often consists of multiple parts, each of which explains part of the data. Classification methods for such models are called the multimodal classification methods. The model parts may overlap or have insufficient coverage. How to deal best with the problems of overlapping and insufficient coverage? In this paper we propose hierarchical or ...

متن کامل

Multimodal Location Estimation of Videos and Images

Reading is a hobby to open the knowledge windows. Besides, it can provide the inspiration and spirit to face this life. By this way, concomitant with the technology development, many companies serve the e-book or book in soft file. The system of this book of course will be much easier. No worry to forget bringing the multimodal location estimation of videos and images book. You can open the dev...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Neurocomputing

سال: 2022

ISSN: ['0925-2312', '1872-8286']

DOI: https://doi.org/10.1016/j.neucom.2021.10.039